Minimum description length methods of medium-scale simultaneous inference

نویسنده

  • David R. Bickel
چکیده

Nonparametric statistical methods developed for analyzing data for high numbers of genes, SNPs, or other biological features tend to overfit data with smaller numbers of features such as proteins, metabolites, or, when expression is measured with conventional instruments, genes. For this medium-scale inference problem, the minimum description length (MDL) framework quantifies the amount of information in the data supporting a null or alternative hypothesis for each feature in terms of parametric model selection. Two new MDL techniques are proposed. First, using test statistics that are highly informative about the parameter of interest, the data are reduced to a single statistic per feature. This simplifying step is already implicit in conventional hypothesis testing and has been found effective in empirical Bayes applications to genomics data. Second, the codelength difference between the alternative and null hypotheses of any given feature can take advantage of information in the measurements from all other features by using those measurements to find the overall code of minimum length summed over those features. The techniques are applied to protein abundance data, demonstrating that a computationally efficient approximation that is close for a sufficiently large number of features works well even when the number of features is as low as 20. More generally, the MDL-based information for discrimination does not suffer from the asymmetry of the p-value as a measure of evidence for one hypothesis over another.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introduction to Minimum Encoding Inference

This paper examines the minimumencoding approaches to inference, Minimum Message Length (MML) and Minimum Description Length (MDL). This paper was written with the objective of providing an introduction to this area for statisticians. We describe coding techniques for data, and examine how these techniques can be applied to perform inference and model selection.

متن کامل

Ideal Mdl and Its Relation to Bayesianism 1

Statistics based inference methods like minimum message length (MML) and minimum description length (MDL), are widely applied approaches. They are the tools to use with particular machine learning praxis such as simulated annealing, genetic algorithms, genetic programming, artiicial neural networks, and the like. These methods select the hypothesis which minimizes the sum of the length of the d...

متن کامل

Inference of Phrase-Based Translation Models via Minimum Description Length

We present an unsupervised inference procedure for phrase-based translation models based on the minimum description length principle. In comparison to current inference techniques that rely on long pipelines of training heuristics, this procedure represents a theoretically wellfounded approach to directly infer phrase lexicons. Empirical results show that the proposed inference procedure has th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1009.5981  شماره 

صفحات  -

تاریخ انتشار 2010